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Abstract 

O 

Minimal-interval semantics [5] associates with each query over a document a set of 
intervals, called witnesses, that are incomparable with respect to inclusion (i.e., they 
00 form an antichain): witnesses define the minimal regions of the document satisfying the 

query. Minimal-interval semantics makes it easy to define and compute several sophis- 
' _ ' ticated proximity operators, provides snippets for user presentation, and can be used 

— r to rank documents. In this paper we provide algorithms for computing conjunction 

(— ^ and disjunction that are linear in the number of intervals and logarithmic in the num- 

C/3 ber of operands; for additional operators, such as ordered conjunction and Brouwerian 

O difference, we provide linear algorithms. In all cases, space is linear in the number of 

operands. More importantly, we define a formal notion of optimal laziness, and either 
I prove it, or prove its impossibility, for each algorithm. Optimal laziness implies that the 

^» algorithms do not assume random access to the input intervals, and read as little input 

as possible to produce a certain output. We cast our results in the general framework 
^Nj of antichain completions of interval orders, making our algorithms directly applicable to 

other domains. 
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1 Introduction 

Search engines are a popular way to retrieve information in the web. However, the classical 
problem studied by the theory of information retrieval, that of answering a query by returning 
the set of documents that match the information provided by the user, is complicated by the 
huge number of documents to be taken into consideration. On the web retrieving many 
relevant documents is usually not a problem — the documents are simply too many already. 
Rather than recall, precision (in particular, precision in the first 10—20 results) is the main 
issue. 

A first possibility for extending the user capabilities is query expansion, an automatic or 
semi-automatic mechanism that aims at enriching a given query, by using for example some 
semantics extracted from the context, or by asking directly the user what is the intended 
meaning of his/her query. In this case, we start from a very simple query, perhaps expressed 
in some natural language and finally produce a richer (hopefully, more specific) query that 
is to be submitted to the search engine. 

A different, complementary approach is that of providing the user with more powerful (but 
understandable) operators, which however requires to depart from the Boolean model. In this 
paper we pursue this path, focusing on minimal-interval semantics, a semantic model that 
uses antichains of intervals of natural numbers to represent the semantics of a query; this is 
the natural framework in which operators such as ordered conjunction, proximity restriction, 
etc., can be defined and combined freely. Each interval is a witness of the satisfiability of the 
query, and defines a region of the document that the query satisfies (words in the document 
are numbered starting from 0, so regions of text are identified with intervals of integers). For 
instance, a query formed by the conjunction of two terms is satisfied by the minimal intervals 
of the document containing both terms. 

This approach has been defined and studied in full extent by Clarke, Cormack and 
Burkowski in their seminal paper [5]. They showed that antichains have a natural lattice 
structure that can be used to interpret conjunctions and disjunctions in queries. Moreover, it 
is possible to define several additional operators (proximity, followed-by, and so on) directly 
on the antichains. The authors have also described families of successful ranking schemes 
based on the number and length of the intervals involved [4]. 

The main feature of minimal-interval semantics is that, by its very definition, an antichain 
of intervals cannot contain more than w intervals, where w is the number of words in the 
document. Thus, it is in principle possible to compute all minimal-interval operators in time 
linear in the document size. This is not true, for instance, if we consider different interval- 
semantics approaches in which all intervals are retained and indexed (e.g., the PAT system [7] 
or the sgrep tool [10]), as the overall number of regions is quadratic in the document size. 

In this paper, we attack the problem of providing efficient algorithms for the computa- 
tion of such operators. As a subproblem, we can compute the proximity of a set of terms, 
and indeed we are partly inspired by previous work on proximity [16, 14]. Our algorithms 
are linear in the number of input intervals. For conjunction and disjunction, there is also a 
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multiplicative logarithmic factor in the number of input antichains, which however can be 
shown to be essentially unavoidable in the disjunctive case. The space used by all algorithms 
is linear in the number of input antichains (in fact, we need to store just one interval per 
antichain), so they are a very extreme case of stream transformation algorithms [1, 9]. More- 
over, our algorithms are (with one exception, for which we prove an impossibility result) 
optimally lazy, that is, while building their results they do not advance the input lists more 
than necessary. 1 

We believe that the existence of (almost) linear lazy algorithms for minimal-interval se- 
mantics makes it the natural candidate for advancing web search engines beyond a purely 
Boolean model: in particular, the possibility of limiting the interval width has a very nat- 
ural interpretation for the user in terms of proximity, and ordered conjunction has obvious 
applications. 

Minimal intervals can also be used together with other standard information-retrieval 
techniques. For instance, the Indri search engine [15] expands a query into a number of 
subqueries, many of which are interval-based, and combines the results. 

In Section 2 we briefly introduce minimal-interval semantics, referring to the original 
paper for examples and motivations. The presentation is rather algebraic, and uses standard 
terms from mathematics and order theory (e.g., "interval" instead of "extent" as in [5]). The 
resulting structure is essentially identical to that described in the original paper, but our 
systematic approach makes good use of well-known results from order theory, making the 
introduction self-contained. For some mathematical background, see, for instance, Birkhoff's 
classic [2]. 

Another advantage of our approach is that by representing abstractly regions of text as 
intervals of natural numbers we can easily highlight connections with other areas of computer 
science: antichains of intervals have been used for role-based access control [6], and for testing 
distributed computations [11]. The problem of computing operators on antichains has thus 
an intrinsic interest that goes beyond the problems of information retrieval. This is the reason 
why we cast all our results in the general framework of antichain completion of intervals on 
arbitrary (totally) ordered finite sets. 

Finally, we present our algorithms. First we discuss algorithms based on queues, and then 
greedy algorithms. 2 

2 Minimal-interval semantics 

Given a finite totally ordered set O, let us denote with J*q the set of intervals of O (a subset 
X of O is an interval if x, y € X and x < z < y imply z € X; note that € J?o) ordered 
by inclusion. Our working example will always be w = { 0, 1, . . . , w — 1 }, where to represents 
the number of words in a document, numbered starting from (see Figure 1); elements of 
J? w can be thought of as regions of text. 

Given intervals I and J, the interval spanned by I and J is the least interval containing 
I and J (in fact, their least upper bound in J*o)- Nonempty intervals will be denoted by 
[£ . . r], where I is the left extreme and r is the right extreme (i.e, the smallest and largest 
element in the interval). Intervals are ordered by containment: when we want to order them 
by reverse containment instead, we shall write J'q' ("op" stands for "opposite"). 

The idea behind minimal-interval semantics [5] is that every interval in J> w is a witness 
that a given query is satisfied by a document made of w words. Smaller witnesses imply a 

1 In fact, the algorithms presented here differ significantly from those presented in [3] precisely because of 
the quest for optimal laziness. 

2 A free implementation of all algorithms described in this paper is available as a part of MG4J 
(http : //mg4j . dsi . unimi . it/). 
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Pease porridge hot! Pease porridge cold! Pease porridge in the pot nine days old! Some like it hot, some 
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like it cold, Some like it in the pot nine days old! Pease porridge hot! Pease porridge cold! 
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 

Figure 1: A sample text; the intervals corresponding to the semantics of the query "(hot 
OR cold) AND porridge AND pease" are shown. For easier reading, every other interval is 
dashed. 

better match, or more information; in particular, if an interval is a witness any containing 
interval is a witness. We also expect that more witnesses imply more information. Thus, 
when expressing the semantics of a query, we discard non-minimal intervals, as there are 
intervals that provide more relevant information. As a result, minimal-interval semantics 
associates with each query an antichain 3 of intervals. For instance, in Figure 1 we see a 
short passage of text, and the antichain of intervals corresponding to a query. Note that, for 
instance, the interval [0 . . 3] is not included because it is not minimal. 

It is however more convenient to start from an algebraic viewpoint. An order ideal X 
(henceforth called just an ideal) is a subset of a partial order that is closed downwards: if 
y < x and x E X, then y E X. The ideal completion of an order P is a distributive lattice 
whose elements are the ideals of P ordered by inclusion. We are interested in computing 
operators on the ideal completion of J?q P , which will be the base of our semantics: 

So = { X C ,y° p | AT is an ideal }. 

It is known that an ideal over a finite partial order is uniquely represented by the antichain of 
its maximal elements. Intuitively, the antichain of maximal elements is the "upper border" of 
the ideal. Because of this bijection, antichains of intervals are endowed with a partial order, 
and with the algebraic structure of a distributive lattice, which turns out to be a very handy 
representation of So- 

The lattice of antichains S w thus defined is essentially the classic Clarke-Cormack- 
Burkowski minimal-interval lattice, with the important difference that since we allow the 
empty interval, we have a top element that has the empty interval only as a witness. For the 
purposes of this paper, the difference is immaterial, though. 

To make the reader grasp more easily the meaning of So, we now describe in an elementary 
way its order and its lattice operations (note that we are not giving a definition: the operations 
are simply the reflection on the set of antichains of those of So). Given antichains A and B, 
we have 

A< B V/ E A 3J EB J CI. 

Intuitively, A < B if every witness I in A (an interval) can be substituted by a better (or 
equal) witness J in B, where "better" means that the new witness J is contained in /. 

Correspondingly, the V of two antichains A and B is given by the union of the intervals 
in A and B from which non-minimal intervals have been eliminated. Finally, the A of A and 
B is given by the set of all intervals spanned by a pair of intervals I E A and J E B, from 
which non-minimal intervals have been eliminated. It is this very natural algebraic structure 
that has led to the definition of the Clarke-Cormack-Burkowski lattice. 

For instance, consider from Figure 1 the positions of "porridge" (1,4,7,32,35), "pease" 
(0,3,6,31,34) and "hot OR cold" (2,5,17,21,33,36), seen as sets of singleton intervals; by picking 

3 An antichain of a partial order is a subset of elements that are pairwise incomparable. 
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one interval from each of the three sets, we generate a large number of spanned intervals, but 
the minimal ones are just 

{ [0.. 2], [1.. 3], [2.. 4], [3.. 5], [4.. 6], [5.. 7], [6.. 17], [7.. 31], 

[21 . . 32], [31 . . 33], [32 . . 34], [33 . . 35], [34 . . 36] }. 

A simple snippet extraction algorithm would compute greedily the first k smallest nonover- 
lapping intervals of the antichain, which would yield, for k = 3, the intervals [0 . . 2], [3 . . 5], 
[31.. 33], that is, "Pease porridge hot!", "Pease porridge cold!", and, again, "Pease porridge 
hot!". A ranking scheme such as those proposed in [4] would use the number and the length 
of these intervals to assign a score to the document with respect to the query. 

Finally, we remark that the intervals in an antichain can be ordered in principle either by 
left or by right extreme, but these orders can be easily shown to be the same, so we can say 
that the intervals in an antichain are naturally linearly ordered by their extremes. 

3 Operators 

For the rest of the paper, we assume that we are operating on antichains based on an unknown 
total order O for which we just have a comparison operator. We use ±oo to denote a special 
element that is strictly smaller/larger than all elements in O. Before getting to the core of 
the paper, however, we highlight the connection with query resolution in a search engine. 

Search engines use inverted lists to index their document collections [19]. The algorithms 
described in this paper assume that, besides the documents in which a term appear, the 
index makes available the positions of all occurrences of a term in increasing order (this is a 
standard assumption, as it is necessary to perform gap-encoding). 

Given a query (that we shall not define formally: the syntax is implied by our choice 
of operators), we first obtain the list of documents that could possibly satisfy the query; 
this is a routine process that involves merging and intersecting lists. Once we know that a 
certain document might satisfy the query, we want to find its witnesses, if any. To do so, we 
interpret the terms appearing in the query as lists of singleton intervals (the term positions), 
and apply in turn each operator appearing in the query. The resulting antichain represents 
the minimal-interval semantics (i.e., the set of witnesses) of the query with respect to the 
document. 

For completeness, we define explicitly the operators 4 AND and OR, which are applied 
to a list of input antichains Aq, A\, . . . , A m _\, resulting in the A and V, respectively, of 
the antichains A , A±, . . . , A m _i. Besides, we consider other useful operators that can be 
defined directly on the antichain representation [5]. With this aim, let us introduce a relation 
<C between intervals: I <C J iff x < y for all x £ I and y <E J. 

1. ("disjunction operator") OR, given input antichains Aq, A\, . . . , A m _\, returns the set 
of minimal intervals among those in Ao U A\ U • • • U A m _\. 

2. ("conjunction operator") AND, given input antichains Aq, A\, A m -\, returns the 
set of minimal intervals among those spanned by the tuples in A$ x A\ x • • • x A m _\. 

3. ("phrasal operator") BLOCK, given input antichains Ao, A\, . . . , A m _\, returns the set 
of intervals of the form Iq U I\ U • • • U I m -i with I, € 4i (0 < i < m) and 7j_i <C Ii 
(0 < i < m). 

4 The reader might be slightly confused by the fact that we are using A and AND to denote essentially 
the same thing (similarly for V and OR). The difference is that A is a binary operator, whereas AND has 
variable arity. Even if the evaluation of AND could be reduced, by associativity, to a composition of As, from 
the viewpoint of the computational effort things are quite different. 
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4. ("ordered non- overlapping conjunction operator") AND<, given input antichains Aq, A\, 

A m -i, returns the set of minimal intervals among those spanned by the tuples 
(To, h, ■ ■ ■ , I m -i) € A x A 1 x • • • x A m -i satisfying I^i < I,. 

5. ('iow-pass operator") LOWPASSfc, given an input antichain A, returns the set of inter- 
vals from A not longer than k. 

6. ("Brouwerian difference operator") Given two antichains A (the minuend) and B (the 
subtrahend), the difference A — B is the set of intervals I £ A for which there is no 
J e B such that J C I. 

More informally, given input antichains A , A\, . . . , A m _i, the operator BLOCK builds 
sequences of consecutive intervals, each of which is taken from a different antichain, in the 
given order. It can be used, for instance, to implement a phrase operator. The AND< 
operator is an ordered-AND operator that returns intervals spanned by intervals coming 
from the Ai, much like the AND operator. However, in the case of AND< the left extremes 
of the intervals must be nondecreasing, and the intervals must be nonoverlapping. This 
operator can be used, for instance, to search for terms that must appear in a specified order. 
LOWPASSfc restricts the result to intervals shorter than a given threshold, and be easily 
combined with AND or AND< to implement searches for terms that must not be too far 
apart, and possibly appear in a given order. Finally, the Brouwerian difference considers the 
interval in the subtrahend as "poison" and returns only those intervals in the minuend that 
are not poisoned by any interval in the subtrahend; this operator finds useful applications, for 
example, in the case of passage search if the poisoning intervals are taken to be small (possibly 
singleton) intervals around the passage separators (e.g., end-of-paragraph, end-of-sentence, 
etc.). 

Note that the natural lattice operators AND and OR cannot return the empty antichain 
when all their inputs are nonempty. This is not true of the above operators: for instance, 
BLOCK might fail to find a sequence of consecutive intervals even if all its inputs are 
nonempty. 

Finally, we remark that all intervals satisfying the definition of the BLOCK operator are 
minimal. Indeed, assume by contradiction that for two concatenations of minimal intervals 
we have [£ . . r] C [£' . . r'\ (which implies either £' < £ or r < r') . Assume that £' < £' 
(the case r < r' is similar), and note that removing the first component interval from both 
concatenations we still get intervals strictly containing one another. We iterate the process, 
obtaining two intervals of A m _\ strictly containing one another. 

4 Lazy evaluation 

The main point of this paper is that algorithms for computing operators on antichain of 
intervals should be always lazy and linear in the input intervals: if an algorithm is lazy, when 
only a small number of intervals is needed (e.g., for presenting snippets) the computational 
cost is significantly reduced. Linearity in the input intervals is the best possible result for a 
lazy algorithm, as input must be read at some point. All algorithms described in this paper 
satisfy this property, albeit in the case of AND and OR there is also a logarithmic factor in 
the number of input antichains. 

Note that if the inverted index provides random-access lists of term positions, algorithms 
such as those proposed in [5] might be more appropriate for first-level operators (e.g., logical 

5 This operator satisfies the property that A—B < C iff A < BvC; it is sometimes called pseudo-difference, 
and its definition is dual to that of relative pseudo-complement [2]. 
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operators computed directly on lists of term positions), as by accessing directly the term 
positions they achieve complexity proportional to ms log n, where n is the overall number 
of intervals in the input antichains, m is the number of antichains, and s is the number of 
results. Nonetheless, as soon as one combines several operators, the advantage of a lazy 
linear implementation is again evident. Moreover, s is in principle bounded only by n, and 
the estimate above hides the fact that the input antichains must have been computed, with 
a time cost and space occupancy Q(n). 

The logarithmic factor in the number of antichains can be easily proved to be unavoidable 
for the OR operator in a model in which intervals can be handled just by comparing their 
extremes: 

Theorem 1 Every algorithm to compute OR that is only allowed to compare interval ex- 
tremes requires f2(nlogn) comparisons for n input intervals. 

Proof. It is possible to sort n distinct integers by computing the OR of n antichains, each 
made by just one singleton interval containing one of the integers to be sorted. The resulting 
antichain is exactly the list of sorted integers. By an application of the fi(nlogn) lower 
bound for sorting in this model, we get to the result. I 

4.1 Optimal laziness 

The term "lazy" is usually quoted informally, in particular in the context of functional or 
declarative programming. In this paper we consider algorithms that access input antichains 
under the form of lists that return the corresponding intervals in their natural order. We want 
to define formally a notion of laziness that makes it possible proving rigorously optimality 
results. We restrict to algorithms that read their inputs from an array of lists. Each list is 
accessible via a "next" function that returns the next element from the list, and when a list 
is empty it returns null. Analogously, each algorithm has a "next" function that returns the 
next output, and when the output is over it returns null. So such algorithms can be thought 
of as producing an output list. 

Given an algorithm srf ', an input / (i.e., an array of lists), let us write pf(I,p) for the 
number of elements (including possibly null) read by stf from the i-th list of the input array 
/ when the p-th output is produced (sometimes, we will omit gtf , I or p when they are clear 
from the context); when writing pf(I,p) we shall always assume that the < i < m (where 
m is the number of input lists) and that the output of srf on input / contains at least p 
intervals. 

A first property that we would like our algorithms to feature is that there is no algorithm 
that uses strictly less inputs: 

Definition 1 Two algorithms are functionally equivalent iff they produce the same output 
list when they are given the same input lists. An algorithm srf is minimally lazy if, for every 
functionally equivalent algorithm £8 such that 

pf(I,p)<pf(I,p) 

for all I in the set of inputs and all p, we actually have 

pf(I,p)=pf(I,p). 

In fact, for most of our algorithms we will be able to prove a more interesting property: 
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Definition 2 An algorithm srf is k-lazy iff for every functionally equivalent algorithm SB, 
and for all input /, and all i and p we have 

pf(I,p)<pf(I,p) + k. 

An algorithm is optimally lazy if it is fc-lazy for some k, and there exist no functionally 
equivalent (k — l)-lazy algorithm. 

Optimally lazy algorithms advance their inputs as little as possible when emitting an output. 
Minimally optimally lazy algorithm have the further property that no improvement can be 
obtained on a particular input without getting a worse result on some other input. Note that 
since by definition there are no /c-lazy algorithms when k is negative, a 0-lazy algorithm is 
always minimally and optimally lazy. 

There is a subtlety in Definition 1 and 2 that is worth remarking. By requiring that the 
parameter p is never greater than the number of intervals in the output, we are not considering 
how many elements are read from the input lists to emit the final null. In principle, this 
choice implies that even minimally optimally lazy algorithms may consume useless input 
elements to emit their final null. A more thorough analysis would be required to include 
also this case, but it would yield a further subdivision of the above taxonomy of optimality: 
indeed, for some problems we consider it is easy to show there is no null-optimal solution. 
We think that such an analysis would add little value to the present work, as behaving lazily 
on non-null outputs is a sufficiently strong property by itself. 

5 General remarks 

In the description and in the proofs of our algorithms, we use interchangeably Ai to denote the 
i-th input antichain and the list returning its intervals in their natural order (and, ultimately, 
null). This ambiguity should cause no difficulty to the reader. 

To simplify the exposition, in the pseudocode we often test whether a list is empty. Of 
course, this is not allowed by our model, but in all such cases the following instruction 
retrieves the next interval from the same list. Thus, the test can be replaced by a call that 
retrieves the next interval and tests for null. Finally, we can assume that after the function 
"next" has returned null once, it will keep returning null thereafter: this behaviour can be 
obtained by using an extra flag that avoids entering the function altogether. 

In all our algorithms, we do not consider the case of inputs equal to the top of the lattice 
(the antichain formed by the empty interval). For all our operators, the top either determines 
entirely the output (e.g., OR) or it is irrelevant (e.g., AND). Analogously, we do not consider 
the case of inputs equal to the bottom of the lattice (the empty antichain), which can be 
handled by a test on the first input read. 

More generally, when proving optimal laziness, it is common to meet situations in which 
an initial check is necessary to rule out obvious outputs. The initial check can make the 
algorithm analysis more complicated, as its logic could be wildly different from the true 
algorithm behaviour. To simplify this kind of analysis, we prove the following metatheorem, 
which covers the cases just described; in the statement of the theorem, represent the 
algorithm performing the initial check, whereas SB does the real job: 

Theorem 2 Let SB be an algorithm defined on a set of inputs B, and srf be defined on a 
larger set of inputs AZ) B, and such that 

• on all inputs I £ B, &/ outputs a one-element list containing a special element, say _L, 
and 
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• for all I £ B and all i, pf(1, 1) < pf(I, 1). 

Then, there exists an algorithm, denoted by si * S3, such that 

• si * S3 is functionally equivalent to S3 on B; 

• si * SB is functionally equivalent to ,e/ on i \ B; 

• if ^ and SB are (minimally) optimally lazy on A \ B and B, respectively, then si * S3 
is (minimally) optimally lazy on A 

Proof. Algorithm si ★ S3 simulates algorithm si and caches the input read so far. If si 
emits any element different from _L, the simulation goes on until si is done, without caching 
the input any longer; otherwise, si ' -kS3 starts executing S3 on the cached input and possibly 
on the remaining part of the input until S3 is done. 

It is immediate to check that si ★ S3 is indeed functionally equivalent to si and S3 on 
A \ B and B, respectively, and moreover 



Suppose now that si is a-lazy and S3 is 6-lazy for some minimal a and b, and let c = 
max{ o, b}. For every algorithm ^ that is functionally equivalent to si * S3, we have that 
pf(I,p) < pf (I,p) + b for all I e B, and pf (I,p) < pf (I,p) + a for all J £ A\B. But then, 
using the observation above, pf(I,p) < pf*^(I,p) + c for all I £ A, so si * S3 is c-lazy. 

Suppose now that ^ is functionally equivalent to si * S3 but that it is (c — l)-lazy, 
and assume that c = b (the other case is analogous). Then, for all / e £>, pf(I,p) < 
y of / *^(/,p) + c — 1 = pf(I,-p) + 6—1; but since ^ is also functionally equivalent to S3 on B, 
the latter inequality contradicts the minimality of b. 

For minimal laziness, suppose that ^ is functionally equivalent to si -k S3 and such that 
pf(I,p) < pf* S3 {I,P) for all I £ A. In particular, this means that pf(I,p) < pf{I,p) for 
all I e A \ B, and pf(I,p) < pf(I,p) for all I e B. The minimal laziness of si and ^ 
imply that pf(I,p) = pf(I,p) for all I e A \ B and pf(I,p) = pf(I,p) for all I £ B, hence 
pf(I,p) = pf ** (I, p) for all 7" € A I 

Incidentally, we observe that ★ J? requires in general more space than si or because 
of caching; nonetheless, in all our applications we will need to cache just one item per input 
list. 

6 Algorithms based on indirect queues 

The algorithms we provide for AND and OR are inspired by the plane-sweeping technique 
used in [16] for their proximity algorithm, which is on its own right a variant of the standard 
sorted-list merge. The algorithms are implemented using an indirect priority queue. 

An indirect priority queue Q is a data structure based on an array (called the reference 
array), which is managed outside the queue itself, and a priority order that compares items 
from the reference array. At each time, the queue contains a set of indices into the reference 
array (initially, a specified set, possibly empty). An array index x can be added to the queue 
calling the function enqueue(Q,a;). 

The index of the least item in the reference array with respect to the priority order can 
be accessed by invoking the function topIndex(Q). The index of the least item with respect 
to the priority order is also returned by dequeue(Q), which additionally removes the index 
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enqueue(Q,:r) 


insert item with index x in the queue 


topIndex(Q) 


returns the index of the top item 


top(Q) 


returns the top item 


dequeue (Q) 


returns the index of the top item 




and deletes it from the queue 


change(Q) 


signals that the top item has changed 


size(Q) 


returns the number of indices currently in the queue 



Table 1: The operations available for an indirect priority queue. 



from Q. Analogously, top(Q) return the least item in the reference array with respect to the 
priority order. 

The data structure assumes that the only item of the reference array that might change 
its value is the top item. Such a change must be communicated immediately to the queue by 
calling the function change(Q). Table 1 summarises the operations available on an indirect 
priority queue. 6 . 

A trivial array-based implementation requires linear space (in the number of input lists) 
and has constant cost for all operations modifying the queue, whereas retrieving the top 
requires linear time. A better implementation uses a priority queue (e.g., based on a heap) 
with linear space and logarithmic time complexity for all operations modifying the queue. 
Sophisticated heaps with linear costs for several operations do not modify significantly the 
overall behaviour, as each time the queue is advanced the interval corresponding to the top 
index becomes greater: there are data structures that make it possible to decrease in constant 
time the top, but not increase it (otherwise we could sort in linear time by comparison). 

All algorithms based on indirect priority queues have time complexity 0(n log m) if the 
input is formed by m antichains containing n intervals overall, and use 0(m) space. This is 
immediate, as all loops contain exactly one queue advancement. 

6.1 Basic comparators 

Our algorithms will be based on two priority orders. The first one, denoted by <, is defined 
by 

[£ . . r] < [£' . . r'] -<f=> r < r' or r = r' and £ > I'. 

In other words, \t . . r] < [£' ..r'] if [£--r] ends before or is a suffix of [£' ..r']. Note in 
particular that (somewhat counterintuitively) [£ . . r] < [£' . . r] iff £ > £' . 
The second order, denoted by -<, is defined by 

[£ . . r] d [£' ■ ■ A £ < £' or £ = £' and r > r'. 

In other words, [£ . . r] ^ [£' . . r'] if [£ . . r] starts before or prolongs [£' . . r']. Note in particular 
that [£ . . r] ^ [£ . . r'] iff r > r', and that the following implication holds: 

[£ . . r] C [£' . . r'] => [£..r] <[£'.. r'] and [£' ..r']<[£..r] 

The algorithms for AND/OR use an indirect priority queue with priority order -< or <. 
The reference array underlying the queue contains one interval per input antichain. In the 
initialisation phase, the reference array is filled with the first interval from each antichain, 
and the queue contains all indices. 

6 Actually, a more appropriate name would be semi-indirect queue: an indirect queue has a change oper- 
ation that restores the correct state after a change in the value associated to any index. 
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To simplify the description, we define a procedure advance(Q) that updates with the 
next interval the list associated with the top index and notifies the queue of the change. If 
the update cannot be performed because the list is empty, the top index is dequeued. The 
function is described in pseudocode in Algorithm 1. 

Algorithm 1 The advance function. 



procedure advance(Q) begin 

1 i <— topIndex(<5); 

2 if Ai is not empty then 

3 [li-.Ti] <- next(Ai); 

4 change(Q) 

5 else 

6 dequeue(Q) 

7 end; 

8 end; 



6.2 The OR operator 

We start with the simplest nontrivial operator. To compute the OR of the antichains Aq, Ai, 
. . . , A m -i, we merge them using an indirect priority queue Q with priority order <L 

We keep track of the last interval c returned (initially, c = [— oo . . — oo]). When we want 
to compute the next interval, we advance Q as long as the top interval contains c, and then 
if the queue is not empty we return the top. The algorithm 7 is described in pseudocode in 
Algorithm 2. 

Theorem 3 Algorithm 2 for OR is correct. 

Proof. First of all, note that all intervals in Aq, A\, . . . , A Tn —i are assigned to c at some 
point, unless they contain a previously returned interval. Thus, we just have to prove that 
only minimal intervals are returned. 

Let [£ . . r] be a non-minimal element of Ao U A\ U • • • U A m ^\, and [£' . . r'] the largest (ac- 
cording to <) minimal interval contained in [£ . . r\. After returning [£' . . r'\ (which certainly 
appears at the top of the queue before [£ . . r] due to the fact that C implies <), all intervals 
in the queue have a right extreme larger than or equal to r' . When we advance the queue, 
and until we get past \t . . r], the top interval will always contain [£' . . r'], for otherwise there 
would be a minimal interval with right extreme between r' and r, and [£' . . r'] would not be 
largest. Thus, the while loop will remove at some point [£ . . r]. 

To prove that all returned intervals are unique, we just have to note that when / is 
returned, all other copies of / are in the reference array. Thus, at the next call the while loop 
will be repeated until all remaining copies are discarded. I 

Theorem 4 Algorithm 2 for OR is minimally and optimally lazy. 

Proof. We show that the algorithm (let us call it srf) is 0-lazy. The first output of the 
algorithm requires reading exactly one interval from each list. No correct algorithm can emit 
the first output without this data. 

7 Note that this algorithm, as discussed in Section 8, can be derived from the dominance algorithms 
presented in [12]. 
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Algorithm 2 The algorithm for the OR operator. Note that the second part of the while 
condition is actually equivalent to "left(top(Q)) < left(c)" due to the monotonicity of the 
top-interval right extreme. 



Initially c <— [—00 . . —00] and Q contains one interval from each A t . 

1 function next begin 

2 while Q is not empty and c C top(Q) do 

3 advance(Q) 

4 end; 

5 if Q is empty then return null; 

6 c <— top(Q); 

7 return c 

8 end; 



Suppose now that for an algorithm &/* it happens that 

pf(I,p)<pf(I,p) 

for some input / and some i and p. Since upon returning the p-th output [£ . . r] the reference 
array contains the least interval (w.r.t. <) after [£ . . r] from each list, this means that srf* 
emits [£ . . r] having read from the i-th input list an interval [£' . . r'] strictly smaller than 
[£ . . r] according to <; this means that either r' < r, or r' = r and £ < £' , but the latter case 
is ruled out by minimality of [£ . . r]. Thus, r' < r, and would return an incorrect output 
if the i-th input list would return [s . . s] as next input, with r' < s < r. I 

Note that for the last proof the genericity of the underlying order is essential: if we knew 
that there are no elements between r' and r we could not obtain the contradiction. 

6.3 The AND operator 

Then AND operator is much more subtle. The priority order of Q is and additionally 
the queue keeps track of the largest right extreme of intervals in the reference array, which 
will we call the right extreme of Q (we just need a variable that is maximised with the right 
extreme of each new input interval, as at the first dequeueing we shall return null) . We say 
that Q is full if it contains exactly m indices. 

At any time, the interval spanned by Q is the interval defined by the left extreme of the 
top interval and the right extreme of Q: it will be denoted by span(Q). Clearly, it is the 
minimum interval containing all intervals currently in the queue. 

We keep track of the last interval c returned (initially, c = [—00 . . —00]). When we want 
to compute the next interval, we first advance Q until the spanned interval does not contain 
c, and in case Q is no longer full we return null. Then, we store the interval [£ . . r] currently 
spanned by Q as a candidate and advance Q. If the new interval spanned by Q is included 
in [£ . . r] we repeat the operation, updating the candidate. Otherwise (or if Q is no longer 
full) we just return the candidate. The algorithm is described in pseudocode in Algorithm 3. 

Theorem 5 Algorithm 3 for AND is correct. 

Proof. We say that a queue configuration is complete if it contains all copies of the top 
interval from all lists that contain it. Now observe that every complete configuration of an 
indirect priority queue is entirely defined by its top interval. More precisely, if the top is an 
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Algorithm 3 The algorithm for the AND operator. Note that the second part of the first 
while condition can be substituted with "left(top(Q)) = left(c)" because of monotonicity of 
the largest right extreme, and that the second part of the second while condition can be 
substituted with "right(c) = right(Q)" by monotonicity of the top-interval left extreme. 



Initially c *— [—00 . . —00] and Q contains one interval from every 

1 function next begin 

2 while Q is full and c C span(Q) do 

3 advance (Q) 

4 end; 

5 if Q is not full then return null; 

6 do 

7 c <— span(Q); 

8 if c = top(Q) then return c; 

9 advance (Q) 

10 while Q is full and span(Q) C c; 

11 return c 

12 end; 



interval I from list i, then for every other list j the corresponding interval J in the queue is 
the minimum interval in Aj larger than or equal to / (according to <). Indeed, suppose by 
contradiction that there is another interval K from Aj satisfying 

I < K ~< J. 

Then, at some point K must have entered the queue, and when it has been dequeued the 
top must have become some interval /' -< /, so we get 

K < I' < I < K, 

which yields K = I: a contradiction, as we assumed the configuration of the queue to be 
complete. 

We now show that for every minimal interval [£ . . r\ in the AND of Aq, Ai : . . . , A m _\ 
there is a complete configuration of Q spanning [£ . . r] . Consider for each i the set Cj of 
intervals of Ai contained in [£..r]. At least one of these sets must contain a (necessarily 
unique) right delimiter, that is, an interval of the form [£' . . r] (see Figure 2). Moreover, at 
least one of the sets containing a delimiter must be a singleton. Indeed, if every Ci containing 
a right delimiter would also contain some other interval, the right extreme of that interval 
would clearly be smaller than r: the maximum of such right extremes, say r' < r, would 
define a spanned interval \t . . r'] showing that [£ . . r] was not minimal. We conclude that at 
least one Ci, say CV, is a singleton containing a right delimiter. 

Let Ii be the leftmost interval in each Ci\ these intervals are a complete configuration of 
Q: if I t — [I. .r'] is the ^-smallest among such intervals and if Ii e Aj necessarily Ii = Ij, 
because Aj cannot contain two intervals with the same left extreme. The set of intervals also 
spans [£ . . r] (because the right extreme of J, is r, and the left extreme of the ^-least interval 
Ii is £). We conclude that all minimal intervals in the output are eventually spanned by Q. 

However, no minimal interval can be spanned during the first while loop, unless it has 
been already returned, as all intervals spanned in that loop contain a previously returned 
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Figure 2: A sample configuration found in the proofs of Theorems 5 and 6. The dashed 
intervals are right delimiters. The first two input lists are in the inner set; the last two input 
lists are in the conflict set; the last input list is also in the resolution set. 

interval (notice that at the first call the loop is skipped altogether). Finally, if an interval 
is spanned in the second while loop and we do not get out of the loop, the next candidate 
interval will be smaller or equal. We conclude that sooner or later all minimal intervals cause 
an interruption of the second while loop, and are thus returned. 

We are left to prove that if an interval is returned, it is necessary minimal. If we exit 
the loop using the check on the top interval, the returned interval is necessary minimal. 
Otherwise, assume that the interval [£ . . r] spanned by Q at the start of the second while loop 
is not minimal, so [£ . . r] C [^..r], for some minimal interval [£ ..r] that will be necessarily 
spanned later (as we already proved that all minimal intervals are returned). Since the right 
extreme of Q is nondecreasing, the second while loop will pass through intervals of the form 
[£' . . r], with £<£'<£, until we exit the loop. 

Finally, we remark the uniqueness of all returned intervals is guaranteed by the first while 
loop. I 

Note that our algorithm for AND cannot be 0-lazy, because the choices made by the queue 
for equal intervals cause different behaviours. For instance, on the input lists { [0 . . 0], [2 . . 2] }, 
{ [1 . . 1] }, { [0 . . 0], [2 . . 2] } the algorithm advances the last list before returning [0 . . 1], but 
there is a variant of the same algorithm that keeps intervals sorted lexicographically by < 
and by input list index, and this variant would advance the first list instead. 

Nonetheless: 

Theorem 6 Algorithm 3 for AND is minimally and optimally lazy. 

Proof. We denote Algorithm 3 with srf ', and let srf* be a functionally equivalent algorithm. 
Let us number the intervals appearing in a certain input / = Aq, Ai, . . . , A m -i: in particu- 
lar, let \£\ . . r?] be the j-th interval appearing in A^. For sake of simplicity, let us identify the 
null returned as last element by the input lists with the interval [oo . . oo] (it is immediate to 
see that stf behaves identically). Let us write pi (respectively, p*) for pf(I,p) (respectively, 
pf (I,p)), and [£ . . r] be the p-th output interval; let also s, be the index of the first interval 
in list Ai that is included in [£ . . r\. 
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We divide the indices of the input lists in two sets: the inner set is the set of indices i for 
which I < £"* (that is, the first interval of Ai included in [£.. r] has left extreme larger than 
£); the conflict set is the set of indices i for which I — £\ l (that is, the first interval of A{ 
included in [^..r] has left extreme equal to £). Finally, the resolution set is a subset of the 
conflict set containing those indices i for which rp +1 > r (that is, the successor of the first 
interval of Ai included in [I. . r] is no longer contained in [^..r]). Note that the resolution 
set is always nonempty, or otherwise [£ . . r] would not be minimal (recall that we substituted 
null with [oo . . oo]). The situation is depicted in Figure 2. 

We remark the following facts: 

(i) . for all i, p* > s^; that is, when srf* outputs [^..r] it has read at least the first interval 

of the antichain with left extreme larger than or equal to £; otherwise, &/* would emit 
[£ . . r] even on a modified input in which Ai has no intervals contained in [£ . . r] (such 
intervals have index equal to or greater than Si, so they have not been seen by &/* , 
yet); 

(ii) . for all i in the inner set, pi — Si < p\; 

(iii) . for all i in the conflict set, pi £ {si,Si + 1}; that is, in the case an antichain does 

contain an interval J with left extreme £, either the last interval read by when [£ . . r] 
is output is exactly J, or it is the interval just after J; 

(iv) . if for some i we have [^** . . r? 4 ] = [£ . . r], then pj = Sj for all j, because we exit the 

second while loop at line 8; 

(v) . otherwise, there is a unique index i in the resolution set such that pi = s, + 1 (i.e., 

rf f > r), and for all other resolution indices i we have pi = Sj (i.e., r P ' < r); this 
happens because we interrupt the second while loop when we see the first interval 
whose right extreme exceeds r (at line 10). 

Let us first prove that si is 1-lazy by showing that pi < p* + 1: this is true for all indices in 
the inner set because of (ii), and for all indices in the conflict set because of pi < s, + l < p* + l 
(by (iii) and (i)). 

Now, let us show that &/* cannot be 0-lazy. Suppose it is such; then, in particular, 
Pi < Pi f° r all indices i, and we can assume w.l.o.g. that p* < pi for some i (if for all inputs, 
all output prefixes and all i we had p* — pi, then we would conclude that is 0-lazy as well, 
contradicting the observation made before this theorem) . 

Note that we can also assume w.l.o.g. not to be in case (iv) (as in that case pi = p* for 
all i), which also implies that £ ^ r. Thus, the unique index i of (v) is also the only index in 
the resolution set such that p\ = 1 must advance some list in the resolution set, or 
it would emit a wrong output on a modified input in which the (sj + l)-th interval of Ai is 
[r . . r] for all i in the conflict set). 

Let io, i\, it-i be the indices in the conflict set for which pi p — Si p + 1, in the 
order in which they are accessed from the corresponding lists by sd: clearly i t _i = i is 
the only resolution index in this sequence, by (v). Let jo, j\, j u -i be the indices 
in the conflict set for which p*^ = Sj p + 1, in the order in which they are accessed from 
the corresponding lists by &/* . Necessarily, {jo,ji,... ,j u -i } C { io,ii, • • ■ , %t-\ } (because 
Sj p + 1 = p* < pj p < Sj p + 1, hence pj p = Sj p + 1) and inclusion is strict (because, for some 
index i, p* < pi, hence Sj < p* < Pi < Sj + 1, which implies that i — i v for some v, whereas 
i =/= j v for all v) . 

Let p be the first position that srf and sd* choose differently, that is, i p ^ j p (this happens 
at least at the position of jo, j\, . . . , j u -\ where i appears). We build a new input similar 
to Aq, Ai, . . . , A m _i, except for Ai p and Aj p , which are identical up to their interval of left 
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extreme £; then, Ai continues with \r' . .r'] for some r' > r (so i p is in the resolution set), 
whereas Aj continues with [r . . r] (so j p is in the inner set). On this input, to output [£ . . r] 
srf advances the input list Aj strictly less than srf* , which contradicts the assumption on 



7 Greedy algorithms 

The remaining operators admit greedy algorithms: they advance the input lists in a specified 
order until some condition becomes true. The case of LOWPASS& is of course trivial, and 
the algorithm for BLOCK is essentially a restatement in terms of intervals of the folklore 
algorithm for phrasal queries. They are both optimally and minimally lazy. The case of 
AND< and Brouwerian difference are more interesting: AND< is the only algorithm for 
which we prove the impossibility of an optimally or minimally lazy implementation in the 
general case. 

All greedy algorithms have time complexity 0(n) if the input is formed by m antichains 
containing n intervals overall, and use O(m) space. This is immediate, as all loops advance 
at least one input list. 

7.1 The BLOCK operator 

The BLOCK operator is the only one that can be implemented exclusively if the underlying 
total order is discrete, that is, if it admits a notion of successor. In discussing this algorithm, 
we shall assume that every element x G O has a successor, denoted by x + 1, satisfying 
x < x + 1 and x < y < x + 1 =4> x = y or y = x + I. 

We keep track of a current interval for all lists Ao, A\, . . . , A TO _i; initially, these intervals 
are set to [— oo . . — oo]. When we want to compute the next interval, we update the interval 
associated to the first list. Then, we try to fix index i (initially, i = 1). To do so, we advance 
the list Ai until the returned interval has left extreme larger than the right extreme of the 
current interval for If we go too far, we just advance the first list, reset i to 1 and 

restart the process, otherwise we increment i. When we find an interval for A m -\ we return 
the interval spanned by all current intervals. The algorithm is described in pseudocode in 
Algorithm 4. 

Theorem 7 Algorithm 4 for BLOCK is correct. 

Proof. At the start of an iteration of the external while loop (line 5) with a certain index i 
we clearly have + 1 = £k+i for k = 0, 1, . . . , i — 2. Thus, if we complete the execution of 
the loop we certainly return a correct interval. 

To complete the proof, we start by proving the following invariant property: at line 5, 
for all < j < m there are no intervals in Aj with left extreme in [rj-\ + 1 . .£j — 1]. In 
other words, the j-th current interval [£j . . rj] has either left extreme smaller than or equal 
to rj-i, or it is the first interval in Aj whose left extreme is larger than Tj-\. The property 
is trivially true at the beginning, and advancing [£q . . ro] cannot change this fact. We are left 
to prove that the execution of the internal while loop (line 6) cannot either. 

During the execution of the loop at line 6, only \£i . . r^] can change. This affects the 
invariant because it modifies the intervals [r^i + 1 . . £ t — 1] and [r^ + 1 . . — 1], but in 
the second case the interval is made smaller, so the invariant is a fortiori true. In the first 
case, at the beginning of the execution of the internal while loop either rj_i + 1 < £i — 1, 
that is, Ti-\ < £i, so the loop is not executed at all and the invariant cannot change, or 
rj_i + 1 > £i — 1, which means that the interval [r^i + 1 . . £i — 1] is empty, and the loop will 
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Algorithm 4 The algorithm for the BLOCK operator. 



Initially [£/, . . rfe] <— [—00 . . —00] for all < k < to. 

1 function next begin 

2 if A is empty then return null; 

3 [e ..r ] <- next (A ); 

4 i <- 1; 

5 while i < m do 

6 while £j < r,_i do 

7 if Ai is empty then return null; 

8 [ti . .n] <- next (A,) 

9 end; 

10 if ^ = r-j-i + 1 then i <- i + 1 

11 else begin 

12 if A is empty then return null; 

13 [to--r ] <- next (A ); 

14 i <- 1 

15 end 

16 end; 

17 return [£ ■ • r TO _i] 

18 end; 



advance [£^ . . r,] up to the first interval in A^ with a left extreme larger than r"j_i, making 
again the invariant true. 

Suppose now that there are [£q . . fo], [i~i-.fi], ■ ■ ■ , [$~k ■ ■ fk] satisfying n + 1 = £i + i for 
some k > and < i < k. We prove by induction on k that at some point during the 
execution of the algorithm we will be at the start of the external while loop with i = k and 
[£j . . Tj] — [tj . . fj] for j = 0,1, ... ,k. The thesis is trivially true for k — 0. Assume the 
thesis for k — 1, so we are at the start of the external while loop with i = k — 1 and £j = £j, 
rj = fj for j = 0, 1, ... k — 1. Because of the invariant, either [Ik ■ ■ rk\ = [Ik ■ ■ fk] or [£k ■ ■ fk] 
will be advanced by the execution of the internal while loop up to [£k ■ .fk]- Thus, at the end 
of the external while loop the thesis will be true for k. We conclude that all concatenations 
of intervals from A , Ai, . . . , A m _i are returned. 

We note that all intervals returned are unique (minimality has been already discussed in 
Section 3), as [£q . . ro] is advanced at each call, so a duplicate returned interval would imply 
the existence of two comparable intervals in Aq. I 

Theorem 8 Algorithm 4 for BLOCK is minimally and optimally lazy. 

Proof. The algorithm is trivially 0-lazy, as all outputs are uniquely determined by a tuple of 
intervals from the inputs. An algorithm sd* advancing an input list Ai less than Algorithm 4 
for some output [•£..?"] would emit [£ . . r] even if we truncated Ai after the last interval read 
by £/*. I 

Practical remarks. In the case of intervals of integers, the advancement of the first list at 
the end of the outer loop can actually be iterated until r > £{ — i- This change does not affect 
the complexity of the algorithm, but it may reduce number of iterations of the outer loop. In 
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case the input antichains are entirely formed by singletons 8 , a folklore algorithm aligns the 
singletons circularly rather than starting from the first one (since they are singletons, once 
the position of an interval is fixed all the remaining ones are, too). The main advantage is 
that of avoiding to resolve several alignments if the first few terms appear often consecutively, 
but not followed by the remaining ones. 

7.2 The AND< operator 

The algorithm for computing this operator is a medley of the algorithms for AND and for 
BLOCK: as in the case of AND, we must check that future intervals are not smaller then our 
current candidate [£' . . r'\\ as in the case of BLOCK, there is no queue and the lists Aq, A\, 
A m _i are advanced greedily. Again, we keep track of a current interval . . r,] for 
every list Af, initially, these intervals are [—00 . . —00], except for the first one, which is taken 
from the first list. The algorithm is described in pseudocode in Algorithm 5; an informal 
description follows. 



Algorithm 5 The algorithm for the AND< operator. For sake of simplicity, we use the 
convention that returning [00 . . 00] means returning null, and that if one of the input lists is 
exhausted the function returns null. 



Initially [£ ■ ■ r ] <— next(A ), [Ik ■ ■ ^fe] <— [—00 . . —00] for all < k < m and i <— 1. 

1 function next begin 

2 [£' . ./] <- [00. .00]; 

3 b <— 00; 

4 forever 

5 forever 

6 if rj_i > b then return [£' . . r']; 

7 if i = m or £i > rj_i then break; 

8 do 

9 if r, > 6 or Ai is empty then return [£' . . r']; 

10 [£i..n] <- next (.4;) 

11 while £i < Ti-\\ 

12 + 

13 end; 

14 [£' ..r'} +- [4-.r m _i]; 

15 b<-l m -\; 
16 

17 if Aq is empty then return [£' . . r']; 

18 [io..r ] <- next(^o) 

19 end; 

20 end; 



The core of the algorithm is in the loop starting at line 8: this loop tries to align the i-th 
interval, that is, advance it until . . rj_i] <C [£i ■ .ri\. The loop starting at line 5 aims 
at aligning all intervals; note that we assume as an invariant that, after the first execution, 
every time we discover that the z-th interval is already aligned we can conclude that also the 

8 We emphasize this case because this is what happens with phrasal queries all of whose subqueries are 
simple terms; implementation may treat this special case differently to obtain further optimization, for 
instance using ad hoc indices [17]. 



18 



remaining intervals (the ones with index larger than i) are aligned as well (second condition 
at line 7). 

The loop at line 5 can be interrupted as soon as, trying to align the i-th interval, we 
exhaust the z-th list or find an interval whose right extremes exceeds b, the left extreme of 
the (to — l)-th interval forming the current candidate alignment. If any such condition is 
satisfied, the current candidate is certainly minimal and can thus be returned. 

Upon a successful alignment (line 14), we have a new candidate: note that either this 
is the first candidate (i.e., [£' ..r'] — [oo . . oo] before the assignment), or its right extreme 
coincides with the one of the previous candidate (i.e., r' = r m _x before the assignment), 
whereas its left extreme is certainly strictly larger. In either case, we try to see if we can 
advance the first interval and find a new, smaller candidate with a new alignment: this should 
explain the outer loop. 

Theorem 9 Algorithm 5 for AND< is correct. 

Proof. Let us say that a sequence [£' h ..r' h ] -C [£' h+1 ■ ■ r' h+1 ] <C • • • <C [t'k-i ■ ■ r 'k-i] °f 
intervals (h < k < to), one from each list A^, A^+i, . ■ ■ , A)~-\, is leftmost if, for all h < j < k, 
there are no intervals in Aj with left extreme in (r'j_ 1 . . £'j): such a sequence is uniquely 
determined by k and by [£' h . . r' h ] . Let [£ . . f] be the interval returned at the last call (initially, 
[£ . . f] = [oo . . oo]). Then, the following invariant holds at the start of the loop at line 5: 

1. [£ . . r ] < [h ■ ■ n] <C • • • < [£i-! . . r-j-i] is leftmost; 

2. if li ^ -oo also [£i . . r»] < [£ i+ i . . r i+ {\ < • • • <C [£ m -i ■ ■ r m -i] is leftmost; 

3. if [£i~i . . ri^{\ <C [£i ■ ■ Ti\ then this pair is leftmost. 

The fact that this invariant holds is easy to check; in particular, see the inner while loop at 
line 8 and the exit at line 7. 

We now show that each output interval [£ . . r] is at some time assigned to [£' . .r']. Note 
that i > at all times, so \£q . . ro] is assigned only at the end of the infinite loop. This means 
that [£ . . r ] runs through the whole first input list. 

Thus, as soon as £q = £ the inner loop will either compute the leftmost representation 
of [£ . . f], or exit prematurely. In the second case, the function will necessarily complete the 
leftmost representation at the next call. We conclude that leftmost representations of all 
output intervals are assigned to [£' . . r'] eventually: since [£ . . f] is minimal, it will be emitted 
before [£' . . r'] is assigned again. Uniqueness follows by uniqueness of leftmost representations. 
I 

It is not difficult to see that there is no algorithm for AND< that is fc-lazy for any k, 
except for the case to = 2; indeed: 

Theorem 10 If to > 2, there exist no optimally lazy algorithm for AND<. 

Proof. By contradiction, let 33 be fc-lazy, and observe that, on any given input /, every 
algorithm for AND<, before emitting its p-th output [£ . . r], must have reached at least the 
leftmost sequence [£' . . r' ] <C [£'i ■ ■ r[] <C • • • <s£ [£' m _i ■ ■ r' m _i] spanning it. Now, choose any 
x € (r' m _ 2 ■ ■ £' m -i) an d, for all i = 0, . . . , to — 2, take an arbitrary sequence < u\ < uf < 
■ ■ ■ < u^ +1 € (r[ . . min{ £' i+1 , x }); also choose an arbitrary sequence v° < v 1 < v 2 < ■ ■ ■ < 
v k+i g ^ x _ .£' m _ 1 j. Run f ona different input J, obtained as follows: whenever S3 asks for 
an input from list i < m— 1, we use the original intervals from / only up to \£\ . . r£], and then 
we do the following: if i < m — 1, we start offering [w? . . v°], [u\ . . v 1 ] and so on; as far as the 
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Figure 3: A sample configuration found in the proof of Theorem 10. In this case, m = 4 and 
k = 1. The dashed intervals are those of the form \u\ . . iP] : while reading such intervals it 
is impossible to decide whether the continuous intervals span an element of the output. 

last input list is concerned, we do not make any change. An example of this construction is 
given in Figure 3. 

Note that the intervals are chosen so that 33 cannot yet emit [£..r], because there is 
always some chance for it not to be minimal. We stop testing 3$ as soon as, for some i, 33 
has read at least k + 2 inputs after [£' T . . r' T ] from the i-th list for some i < m — 1; let J' be the 
portion of J read by so far, let j any index different from i and from m — 1 , and let si be an 
algorithm for AND< obtained from 33 by modifying its behavior on the input as follows: when 
faced with an input that coincides with / up to [£' . . r' ] -C [£[ . ■ r[] <C • • • <C \£' m -\ ■ ■ r' m _ 1 ] 
inclusive, it then reads one more interval for each list and, if all these intervals contain any 
common point, say z, it starts reading from list j until an interval not including z is reached, 
or until the j-th list ends, in which case it emits [£..r]. Note that this modification does 
not harm the correctness of the algorithm, but now pf(J',p) + k + 1 = pf(J',p) which 
contradicts the /c-laziness of 38. I 

Hence, for AND<, there is no hope for our algorithm to be optimally lazy in the general 
case; yet, it enjoys three interesting properties: 

Theorem 11 Let si be Algorithm 5 for AND<. 

1. si is minimally lazy; 

2. si is minimally and optimally lazy when m = 2; 

3. for any functionally equivalent algorithm 3§, pf(I,p) < pf(I,p+ 1); that is, our 
algorithm, to produce any output, never reads more input than 33 needs to produce its 
next output. 

Proof. (1) Suppose that 3§ is functionally equivalent to si and p^(I,p) < p^(I,p) for 
every j, I and p, and p^(I,p) < p^(I,p) for some specific J, I and p. Let [£ . . r] be the p-th 
output on input /, and [£' . . r' ] <C ... <C [£' m -\ ■ ■r' m _ 1 ] be its leftmost spanning sequence 
(Figure 4 displays an example); when si outputs [£..r], we have that [£j--rj] — [£'j..r'j] 
for all j > i, whereas the i-th list is over or is such that r% > £' m ^i (with leftmost rj), 
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Figure 4: A sample configuration found in the proof of Theorem 11. The intervals \£\ . . r^] 
form a leftmost spanning sequence, and i = 1, so J = 0. Note that no algorithm can avoid 
reading [l\ . . ri], or it would fail if we replaced it with the dotted interval. 

[^o • • fo] <C ... <C [£i-i ■ ■ Ti-i] is leftmost and I < £q. Since no correct algorithm can emit 
[£ . . r] before scanning its input up to the leftmost spanning sequence, necessarily j < i. 

Moreover, necessarily j ^ i: otherwise, we could modify the inputs by substituting the 
unread intervals of the lists Ai, A4+1, . . . , A m _2 with a suitable sequence of aligned intervals 
which, together with the remaining ones, would span [£ . .r]; this would make [£..r] non 
minimal. 

Now, suppose that J is an input equal to I but modified so that the j-th list ends 
immediately after the last interval read by 88: on input J, algorithm srf does not read a 
single interval from list i beyond [^..r£], because it emits [^..r] as soon as the test for 
emptiness of Aj is performed. So pf (J,p) < pf(J,p), a contradiction. 

(2) We prove that srf is 0-lazy in that case. Indeed, when a certain output \t . . r] is ready 
to be produced, srf tries to read one more interval [£q . . ro] from the first list, and this 
is unavoidable (any other algorithm must do this, or otherwise we might modify the next 
interval so that [£ . . r] is not minimal) . This interval has a right extreme larger than or equal 
to l\, or otherwise [£ ..r] would not be minimal: exits at this point, so it is 0-lazy. 

(3) This is trivial: when &/ outputs an interval, it has not yet reached (or, it has just reached) 
the leftmost sequence spanning the following output, and no correct algorithm could ever emit 
the next output before that point. I 

Practical remarks. In the case of intervals of integers, the check for > b can replaced 
by ri > b — (m — i — 2), and the check for r,_i > b by > b — (m — i — obtaining in some 
case faster detection of minimality. If the input antichains are entirely formed by singletons, 
the check r, > b can be removed altogether, as in that case we know that r% — £{ < rj_i < b. 

7.3 Brouwerian difference 

The Brouwerian difference M — S between antichains M (the minuend) and S (the subtra- 
hend) can be computed by searching greedily, for each interval [£ . . r] in M, the first interval 
[£' . . r'] in S for which £' > £ or r' > r. We keep track of the last interval [£' . . r'] read from 
the input list S (initially, [£' . . r'] = [—00 . . —00]) and update it until £' > £ or r' > r. At 
that point, if we did not exhaust S and [£' . . r'] C [£ . . r] (in which case [£ . . r] should not be 



21 



output) we continue scanning M; otherwise, we return [£ . . r]. The algorithm is described in 
pseudocode in Algorithm 5. 



Algorithm 6 The algorithm for Brouwerian difference. 



Initially [£' . . r'] <- [-00 . . -oo]. 

1 function next begin 

2 while M is not empty do 

3 [£ ..r] <- next(M); 

4 while £' < £ and r' < r and S is not empty do 

5 [£' ..r'] <- next(S) 

6 end; 

7 if S" is empty or . . r'] ^ . . r] then return [I . . r] 

8 end; 

9 return null 



10 end; 



Theorem 12 Algorithm 6 for Brouwerian difference is correct. 

Proof. Note that at the start of the inner while loop (line 4) [£' . . r'\ contains either the 
leftmost interval of S such that £' > I or r' > r, or some interval preceding it. This is 
certainly true at the first call, and remains true after the execution of the inner while loop 
because of the first part of its exit condition (line 4) . Finally, advancing the list of M cannot 
make the invariant false. 

Given the invariant, at the end of the inner loop [£' . . r'] contains the leftmost interval of 
S such that £' > £ or r' > r, if such an interval exists. Note that if [£' . . r'\ is not contained 
in [£ ..r\, then no other interval of S is. Indeed, if £' < £ this means that r' > r, so all 
preceeding intervals have too small left extremes, and all following intervals have too large 
right extremes (the same happens a fortiori if £' > £). Thus, the test at line 7 will emit 
[£ . . r] if and only if it belongs to the output. I 

Theorem 13 Algorithm 6 for Brouwerian difference is minimally and optimally lazy. 

Proof. The algorithm (let as call it &/) is trivially 0-lazy: when [£..r] is output, srf has 
read just [£..r] from M and the first element [£' . .r'] of S such that £' > £ or r' > r. If 
either interval has not been read by some other algorithm srf* , stf* would fail if we removed 
altogether [£ . . r] from M or if we substituted [£' . . r'] with [£ . . r] and deleted all following 
intervals in S. I 



8 Previous work 

The only attempt at linear lazy algorithms for minimal-interval region algebras we are aware 
of is the work of Young-Lai and Tompa on structure selection queries [18], a special type of 
expressions built on the primitives "contained-in", "overlaps", and so on, that can be evaluated 
lazily in linear time. Their motivations are similar to ours — application of region algebras 
to very large text collections. Similarly, Navarro and Baeza- Yates [13] propose a class of 
algorithms that using tree-traversals are able to compute efficiently several operations on 
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overlapping regions. Their motivations are efficient implementation of structured query lan- 
guages that permit such regions. Albeit similar in spirit, they do not provide algorithms for 
any of the operators we consider, and they do not provide a formal proof of laziness. 

The manipulation of antichain of intervals can be translated into manipulation of points in 
the plane compared by dominance — coordinatewise ordering. Indeed, [£ . . r] D [£' . . r'\ iff the 
point (£, —r) is dominated by the point (£', —r'). Dominance problems have been studied for 
a long time in computational geometry: for instance, [12] presents an algorithm to compute 
the maximal elements w.r.t. dominance. This method can be turned into an algorithm for 
antichains of intervals by coupling it with a simple (right-extreme based) merge to produce 
an algorithm for the OR operator. One has just to notice that since dominance is symmetric 
in the extremes, the mapping [^..r] i— ► (—r,£) turns minimal intervals (by containment) 
into maximal points (by dominance). The algorithm described in [12] assume a decreasing 
first-coordinate order of the points, which however is an increasing ordering by right extreme 
on the original intervals. After some cleanup, the algorithm turns out to be identical to our 
algorithm for OR (albeit the authors do not study its laziness). 

The other operators have no significant geometric meaning, and to the best of our knowl- 
edge there is no algorithm in computational geometry that computes them. 

Lazy evaluation is a by-now classical topic in the theory of computation, dating back to 
the mid- 70s [8], originally introduced for expressing the semantics of call-by-need in func- 
tional languages. However, the notion of lazy optimality used in this paper is new, and we 
believe that it captures as precisely as possible the idea of optimality in accessing sequentially 
multiple lists of inputs in a lazy fashion. 

9 Conclusions 

We have provided efficient algorithms for the computation of several operators on the lattice 
of interval antichains. The algorithms for lattice operations require time O(nlogm) for m 
input antichains containing n intervals overall, whereas the remaining algorithms are linear in 
n. In particular, the algorithm for OR has been proved to be optimal in a comparison-based 
model. Moreover, the algorithms are minimally and optimally lazy (with the exception of 
AND< when m > 2, in which case we prove an impossibility result) and use space linear in 
the number of input antichains. Our algorithms compare favourably with previously known 
techniques [5], which in particular required random access to the inputs. 

An interesting open problem is that of providing a matching lower bound for the AND 
operator, at least for a comparison-based computational model. 
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